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METHOD AND APPARATUS FOR COMPRESSING AN INPUT STRING TO 
PROVIDE AN EQUIVALENT DECOMPRESSED OUTPUT STRING 

5 Field of the Invention 

The present invention relates generally to data compression techniques, and 
more particularly, to methods and apparatus for compressing an input string in a manner 
that an equivalent string relative to a noncommutation graph is produced upon 
decompression. 

10 

Background of the Invention 

The ordering of events is fundamental to the study of the dynamic behavior 
of a system. In a sequential process, it is natural to use strings of symbols over some 
alphabet to specify the temporal ordering of events. The symbols may, for example, 

15 correspond to the states, commands, or messages in a computation. J. Larus, "Whole 
Program Paths," ACM SIGPLAN Conf Prog. Lang. Des. Implem., 259-69 (May, 1999), 
applies a lossless data compression algorithm known as "Sequitur" to the sequence of 
events or signals determining the control flow or operations of a program's execution. 
Sequitur is an example of a family of data compression algorithms known as grammar- 

20 based codes that take a string of discrete symbols and produce a set of hierarchical rules 
that rewrite the string as a context-free grammar that is capable of generating only the 
string. These codes have an advantage over other compression schemes in that they offer 
insights into the hierarchical structure of the original string. J. Larus demonstrated that the 
grammar which is output from Sequitur can be exploited to identify performance tuning 

25 opportunities via heavily executed subsequences of operations. 

The underlying premise in using lossless data compression for this 
application is the existence of a well-defined linear ordering of events in time. A partial 
ordering of events is a more accurate model for concurrent systems, such as 
multiprocessor configurations, distributed systems and communication networks, which 

30 consist of a collection of distinct processes that communicate with one another or 
synchronize at times but are also partly autonomous. These complex systems permit 
independence of some events occurring in the individual processes while others mjust 
happen in a predetermined order. Noncommutation graphs are used for one model of 
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concurrent systems. To extend Larus' ideas to concurrent systems a technique is 
considered for compressing an input string in a manner that an equivalent string relative to 
a noncommutation graph is produced upon decompression. 

The compression of program binaries is important for the performance of 
software delivery platforms. Program binaries are files whose content must be interpreted 
by a program or hardware processor that knows how the data inside the file is formatted. 
M. Drinic and D. Kirovskh "PPMexe: PPM for Compressing Software," Proc, 1997 IEEE 
Data Comp. Conf, 192-201 (March 2002), discloses a compression mechanism for 
program binaries that explore the syntax and semantics of the program to achieve 
improved compression rates. They also compress data relative to a noncommutation 
graph. The disclosed compression algorithm employs the generic paradigm of prediction 
by partial matching (PPM). While the disclosed compression algorithm performs well for 
many applications, it introduces certain inefficiencies in terms of compression and delays. 

A need therefore exists for a more efficient algorithm for compressing an 
input string given a set of equivalent words derived from a noncommutation graph. A 
further need exists for a decompression technique that reproduces a string that is 
equivalent to the original string. 

Summary of the Invention 

Generally, a method and apparatus are provided for compressing an input 
string relative to a noncommutation graph. The disclosed compression system compresses 
an input string in a manner that an equivalent string is produced upon decompression. The 
disclosed compression algorithms are based upon normal forms (i.e., a canonical 
representation of an interchange or equivalence class). Generally, the disclosed 
compression process can be decomposed into two parts. First, a normal form of the 
interchange class is produced containing the source output string. Thereafter, a grammar- 
based lossless data compression scheme (or another compression scheme) is applied to the 
normal form. Upon decompression, the compressed string produces an equivalent string. 

A normal form generation process is employed to compute the 
lexicographic normal form or the Foata normal form of an interchange class from one of 
its members, using only a single pass over the data. 
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A more complete understanding of the present invention, as well as further 
features and advantages of the present invention, will be obtained by reference to the 
following detailed description and drawings. 

5 Brief Description of the Drawings 

FIG. 1 illustrates a compression system in which the present invention may 

be employed; 

FIG. 2 is a flow chart describing an exemplary implementation of the 
compression process of FIG. 1; and 
10 FIG. 3 is a flow chart describing an exemplary implementation of a normal 

form generation process that may be employed by the compression process of FIG. 2; and 

FIG. 4 illustrates stacks for the word ddbca when the dependence relation 

G is a-b-c-d. 

15 Detailed Description 

FIG. 1 illustrates a compression system 100 in which the present invention 
may be employed. As shown in FIG. 1, the exemplary compression system 100 includes a 
processor 110 and memory 120. According to one aspect of the invention, the 
compression system 100 compresses an input string 105 in a manner that an equivalent 

20 string 125 is produced upon decompression. As shown in FIG. 1, the memory 120 
includes a compression process 200, discussed further below in conjunction with FIG. 2, 
and a normal form generation process 300, discussed further below in conjunction with 
FIG. 3, that may be employed by the compression process of FIG. 2. 

The present invention provides compression algorithms based upon 

25 variations of a standard notion in trace theory known as normal forms. A normal form is a 
canonical representation of an interchange class. FIG. 2 is a flow chart describing an 
exemplary implementation of the compression process 200 of FIG. 1. Generally, the 
disclosed compression process 200 can be decomposed into two parts. As shown in FIG. 
2, the compression process 200 initially produces a normal form of the interchange class 

30 containing the source output string during step 210. Thereafter, the compression process 
200 applies a grammar-based lossless data compression scheme (or another compression 
scheme) to the normal form during step 220, before program control terminates. Upon 
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decompression, the compressed string produces an equivalent string. 

The 1978 Lempel-Ziv data compression scheme (LZ '78), described, for 
example, in J. Ziv and A. Lempel, "Compression of Individual Sequences Via Variable- 
Rate Coding," IEEE Trans. Inform. Theory IT-24, 530-36 (1978), can be viewed as an 
5 example of a grammar-based code. LZ '78 asymptotically compresses the output of an 
ergodic source to the source entropy with probability 1. J.C. Kieffer and E.-H. Yang, 
"Grammar-Based Codes: A New Class of Universal Lossless Source Codes," IEEE Trans. 
Inform. Theory 46, 737-54 (2000), defines the notion of an irreducible grammar transform 
and demonstrates that any grammar-based codes that use an irreducible grammar 

10 transform is also universal in the sense that it almost surely asymptotically compresses the 
output of an ergodic source to the source entropy. In the illustrative embodiments 
described herein, any universal grammar-based lossless data compression scheme may be 
employed. While it is unknown if Sequitur is a universal compression technique, J.C. 
Kieffer and E.-H. Yang offers a modification of Sequitur that is provably universal. 

1 5 Two examples are discussed for which the codes of the present invention 

attain a new graph entropy referred to herein as the interchange entropy. In both cases, it 
is assumed for simplicity that the original source string was the output of a discrete, 
memoryless source; the analysis can be extended to finite-state, unifilar Markov sources, 
as would be apparent to a person of ordinary skill. In one instance, the dependence 

20 relation on the source alphabet is a complete k-partite graph and in the other case, the 
noncommutation graph contains at least one vertex which is adjacent to all others. For a 
further discussion of interchange entropy, see, S.A. Savari, "Concurrent Processes and the 
Interchange Entropy," Proc. of IEEE International Symposium on Information Theory, 
(Yokohama, Japan, July 2003); S.A. Savari, "On Compressing Interchange Classes of 

25 Events in a Concurrent System," Proc. of IEEE Data Compression Conference, (Snowbird, 
Utah, March 2003), or S.A. Savari, "Compression of Words Over A Partially 
Commutative Alphabet," Information Sciences (IS) Seminar, Cal. Tech., August 27, 
2003, each incorporated by reference herein. 

30 Dependence Relations 

Trace theory is a known approach to extending the notions and results 
pertaining to strings in order to treat the partial ordering of event occurrences in 
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concurrent systems. The idea is to combine the sequence of atomic actions observed by a 
single witness of a concurrent system with a labeled and undirected dependence relation or 
noncommutation graph specifying which actions can be executed independently or 
concurrently. Two words with symbols over a vertex set V are congruent or equivalent 
5 with respect to a noncommutation graph G if each can be obtained from the other through 
a process of interchanging consecutive letters that are nonadjacent vertices in G. For 
example, if the noncommutation graph G is given by a — b — c — d, then the two words 
ddbca and bdadc are congruent since ddbca = G dbdca =g dbdac = G dbadc = G bdadc. 

There are two special cases of the dependence relation which are standard 

10 in information theory. When G is the complete graph on the vertex set V, i.e., when there 
is an edge connecting every pair of vertices, every word over V is congruent only to itself. 
At the other extreme, if G is the empty graph on the vertex set, i.e., if no two vertices are 
adjacent, then two words are congruent if and only if the number of occurrences of each 
symbol in V is the same for both words. The equivalence classes on words are frequently 

15 called type classes or composition classes in the information theory literature and 
rearrangement classes or abelian classes in combinatorics. A congruence class of words 
for an arbitrary noncommutation graph G is often referred to as a trace because they 
represent traces of processes, i.e., the sequence of states traversed by the process from 
initialization to termination, in nonsequential systems. Because the word trace has 

20 numerous connotations, the term interchange class is used herein to refer to an 
equivalence class of words. 

Motivated by the success of J. Larus in applying lossless data compression 
algorithms to a string of events in a sequential system, R. Alur et al., "Compression of 
Partially Ordered Strings," 14th Int'l Conf. on Concurrency Theory (CONCUR 2003), 

25 (Sept. 3, 2003), introduces a compression problem where it is only necessary to reproduce 
a string which is in the same interchange class as the original string. R. Alur et al. 
describes some compression schemes for the congruence class of a string that in the best 
cases can be exponentially more succinct than the optimal grammar-based representation 
of the corresponding string. This compression problem also appears in the compression of 

30 executable code. As previously indicated, executable code or program "binaries" are files 
whose content must be interpreted by a program or hardware processor which knows 
exactly how the data inside the file is formatted in order to utilize it. One of the 
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techniques given in M. Drinic and D. Kirovski for this compression application is 
"instruction rescheduling," in which instructions can be reordered if the decompressed 
program is execution-isomorphic to the original. 
Interchange Entropy 

5 The present invention considers this compression problem from an 

information theoretic perspective. A new generalization of Kolmogorov-Chaitin 
complexity referred to as the interchange complexity is proposed and a version of the 
subadditive ergodic theorem is used to provide sufficient conditions on probabilistic 
sources so that an extension of the asymptotic equipartition property to interchange classes 

1 0 holds. The average number of bits per symbol needed to represent an interchange class is 
referred to as the interchange entropy. The interchange entropy is a functional on a graph 
with a probability distribution on its vertex set. 

For memoryless sources, there are two earlier graph entropies which have 
received considerable attention. The Korner graph, described, for example, in J. Korner, 

15 "Coding of an Information Source Having Ambiguous Alphabet and the Entropy of 
Graphs," in Proc. 6th Prague Conf. on Information Theory, 41 1-25 (1973); or G. Simonyi, 
"Graph Entropy: A Survey," in L. Lovasz, P. Seymour, and W. Cook, ed., DIMACS Vol. 
20 on Special Year on Combinatorial Optimization, 399-441 (1995), has been found to 
have applications in network information theory, characterization of perfect graphs, and 

20 lower bounds on perfect hashing, Boolean formulae size and sorting. Chromatic entropy 
was defined in connection with certain parallel-computing models in R. B. Boppana, 
"Optimal Separation Between Concurrent- Write Parallel Machines," in Proc. 21st Ann. 
ACM Symp. Theory Comp., 320-26 (1989) and demonstrated in N. Alon and A. Orlitsky, 
"Source Coding and Graph Entropies," IEEE Trans. Inform. Theory 42, 1329-339 (1996), 

25 to be linked to the expected number of bits required by a transmitter to convey information 
to a receiver who has some related data. As discussed below, the interchange entropy has 
some properties in common with these other graph entropies. The compression algorithms 
of the present invention can asymptotically achieve the interchange entropy for a large 
collection of dependence alphabets. 

30 R. Alur et al., referenced above, propose three methodologies for encoding 

a string given a partial order on the source alphabet. The first approach is to attempt to 
find a string equivalent to the source output string which compresses well. R. Alur et al. 
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and M. Drinic and D. Kirovski put an alphabetical ordering on the symbols and sort the 
letters of the source output string to produce the equivalent string which is minimal under 
this ordering. The other algorithms of this variety simultaneously determine the 
equivalent string and a grammar-based code for it. These algorithms appear not to be 
5 easily amenable to an information theoretic analysis. 

The second class of procedures put forward in R. Alur et al. involve 
projections of the string onto subsets of the alphabet. A projection of a string a on 
alphabet V onto a subalphabet Ac Vis obtained by removing all symbols in a that are 
not in A. One of the encoding techniques described by R. Alur et al. projects the original 

10 string onto a set of subalphabets with the property that each symbol in V will be in at least 
one of the subalphabets and each pair of adjacent symbols in G will be in at least one of 
the subalphabets. Each of these projections is compressed and, as discussed below, it is 
possible to use the projections to reconstruct a string equivalent to a . Another scheme 
for encoding a string given a partial order on the source alphabet considers the relabeling 

15 of symbols in addition to projections and interchanges. 

INTERCHANGE COMPLEXITY AND INTERCHANGE ENTROPY 

The asymptotic equipartition property is central to the study of lossless data 
compression. It states that most long sequences from a discrete and finite alphabet ergodic 
source are typical in the sense that their mean self-information per symbol is close to the 

20 entropy of the source. A consequence of this result is that the average number of bits per 
symbol required to losslessly encode the output of an ergodic source is asymptotically 
bounded from below by the binary entropy of the source. In order to find a counterpart for 
this lossy compression problem, the least amount of information is considered about an 
individual string that must be described in order to reproduce another string within the 

25 same interchange class. The appropriate framework for this discussion is algorithmic 
information theory. For a finite length string x over the vertex set V, C(x) denotes the 
Kolmogorov complexity of x and refer to M. Li and P. Vitanyi, An Introduction to 
Kolmogorov Complexity and Its Applications, 2d Ed., §2.1, 107, (Springer, New York, 
1997), for the basic properties of C(x). Let V* be the set of all finite words from V and 

30 |V | denote the cardinality of V. The interchange complexity of uv= G wx i s defined with 
respect to a noncommutation graph G with vertex set V by: 
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C t (G 9 x) = min{C(y)\y &\y = G x}. 
CgiGjX) has the interpretation of being the length of the shortest program for a universal 

computer that will print out a word y which is congruent to x with respect to the 
noncommutation graph G. 
5 The following result is one way to characterize the equivalence of two 

strings with respect to a noncommutation graph G: 

Theorem 2.1 (D. Perrin, "Words Over a Partially Commutative Alphabet," 
in A. Apostolico and Z. Galil, ed., Combinatorial Algorithms on Words, NATO ASI 
Series, Volume F12, 329-40 (Springer, Berlin, 1985)): For any subset A of the vertex set 
10 V and any word w over V, let x A (w) denote the projection of w onto A which is obtained 
by deleting from w all symbols which are not in A. The necessary and sufficient 
conditions for two words w and x to be congruent is that they are in the same type class 
and that n {u v] (w) = 7t {u v) (w) for all pairs of symbols w, v e V which are adjacent in G. 

Since Theorem 2.1 specifies the necessary and sufficient conditions for two 

1 5 words to be congruent with respect to a non-commutation graph G, the interchange class 
containing a string can be completely determined by any element of the interchange class 
which can be used to provide the type class and edge projections. Conversely, given the 
type class and edge projections of an interchange class, it is possible to use a knowledge of 
these to produce a word in the interchange class for a noncommutation graph G as follows. 

20 If G is the empty graph, then it is straightforward to use the type class to reconstruct a 
word consistent with the type. If G is not the empty graph, the type class is used to 
determine the number of appearances of any symbol which commutes with every other 
symbol in V. The symbols appearing in the edge projections remain. The leftmost symbol 
in each projection is initially a possibility for the next symbol in our word. If there are any 

25 two symbols, say u and v, among these which do not commute, then the projection onto 
edge {w,v} determines which symbol appears first in the projection, and the other is 
removed from the set of possible next symbols. This procedure is iterated until the set of 
possible next symbols contains no pair of symbols which are adjacent in G. Any symbol 
from this set can be chosen as the next letter. If symbol u is chosen, then the leftmost u is 

30 removed from every edge projection onto u and its neighbors in G. This algorithm is 
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repeated until every edge projection is empty. It follows that C,(G,jc) can be viewed as 

the length, to within O(l), of the shortest program for a universal computer that will 
determine the interchange class containing x. 

Suppose there are words w, v,w,x e V* with u = G w and v = G x . Then it is 

5 easily seen that the words uv and wx formed by respectively appending v to u and x to w 
satisfies uv= G wx. Therefore, the interchange complexity is almost subadditive. In 

particular, to bound C,(G,wv) from above, it is observed that one way to produce a string 

equivalent to uv is to use a shortest program p to find a string w equivalent to u, a shortest 
program q to construct a string x congruent to v, a means to schedule the two programs to 
10 produce w followed by x, and an identification of the programs p and q. Using this 
encoding technique it follows that: 

C,(G,mO<CXG, W ) + C^^ (1) 

Let l(u) denote the length of word ueV*. C { (G 9 u) < C(u),u eV* and 

C{u) </(w)log 2 |F| + 21ogjF| + c for some constant c independent of u and V. Hence, 

1 5 equation ( 1 ) implies that 

C, (G, uv) < C. (G, u) + C f (G, v) + 2 log 2 (min(/(«), /(v))) + 0(1). (2) 
For a word w, . . . w„ with u . e V, i e {1, ...,«} , the behavior of 
n~ l C i (G,u l u 2 ...u n ) is considered for large n. The following result is directly employed: 

Theorem 2.2 (N. G. DeBruijn and P. Erdos, "Some Linear and Some 
20 Quadratic Recursion Formulas I," Indag. Math. 13, 374-82 (1952)): Suppose ^ is a 
positive and nondecreasing function that satisfies 

rv <c °- 

If {x n } satisfies the relaxed subadditivity relation 

*n +m ^x n ^x m +</>{n + m)^<m< 2 n 

25 then as n—>co,x n /n converges to y = inf m ^ x x m I m. 

Hence equation (2) and Theorem 2.2 imply that the asymptotic per symbol 
information content needed to convey a word equivalent to the original bound is well- 
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defined. More specifically: 

Proposition 2.3: For any word u x ...u n with u t e V 9 i e {1,2...,«} , n 

approaches infinity, jT'C^G,!*, ...«„) converges to inf OT&1 m^C^G^ ...u m ). 

Next, a probabilistic version of Proposition 2.3 is found. The 
5 appropriate frame of reference is subadditive ergodic theory. The following 

theorem is utilized: 

Theorem 2.4 (Y. Derriennic, "Un Theoreme Ergodique Presque Sous- 
Additif," Ann. Prob. 11, 669-77 (1983)): Let X mn and A mn ,m<n 9 be two sequences of 

random variables with the following properties: 

' \j,n u,m m,n m,n 

2) X m n is stationary, i.e., the joint distributions of X m n are the same as the 
joint distributions of X m+l , and ergodic. 

3) E[X 0 ,] < oo and for each n, E[X 0n ]> c 0 n with c 0 > -oo. 

4) A„ un >0and lim„_ > ^ 0jB /«J = 0. 



15 Then 



li m ^ = lim ^^ = i nf ^2-Ll almost surely. 



Theorem 2.4 is applied to the output of two very broad categories of 
sources. A discrete source is said to be stationary if its probabilistic specification is 
independent of a time origin and ergodic if it cannot be separated into two or more 
20 different persisting modes of behavior. A more precise definition of a discrete, stationary, 
and ergodic source can be found in R. G. Gallager, Information Theory and Reliable 
Communication, §3.5 (Wiley, New York, 1968). A unifilar Markov source with finite 
alphabet V and finite set of states S is defined by specifying for each state seS and letter 

25 1) the probability p s v that the source emits v from state s; 

2) the unique next state S[s, v] after v is output from state s. 
Given any initial state s 0 eS , these rules inductively specify both the probability P(a\s 0 ) 

that any given source string a e V* is emitted and the resulting state S[s 0 a] after a is 
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output. For the null string 0 and each state s e S , the convention is that P(0|s) = 1 . It is 
assumed that the source has a single recurrent class of states; i.e., for each pair of states s 
and r, there is a non-null string a eV* such that P(cr\s) > 0 and S[s,a] = r . The class of 

unifilar Markov sources is fairly general and includes, for each / > 1 , the group of sources 
for which each output depends statistically only on the / previous output symbols. The 
following result is obtained: 

Theorem 2.5 (A.E.P. for interchange classes): Let £/,,t/ 2 ,... be the random 
output of a finite alphabet, discrete, stationary, and ergodic source or of a finite state and 
finite alphabet unifilar Markov source. Then 

y m ^M = Hm *[c,(G.c/A--.^)3 , inf ^[C,.(G^,c/ 2 ...c/ M )] almost 



surely. 

Unless otherwise specified, it is assumed hereafter that we have 
probabilistic sources P in which the random variables n~ x C i {G,U x ...£/„) converge almost 
surely or in probability to lim^^ n~ x E\C i (G,U x ...£/„)] . The latter expression is referred 

15 to as the interchange entropy and is denoted by //,.(G,P). Just as the asymptotic 
equipartition property for strings leads to a notion of typical sequences which all have 
about the same probability and together constitute the possible outputs of the source with 
high probability, Theorem 2.5 provides a comparable concept of typical interchange 
classes. Most long strings require close to //,(G,P) bits per symbol to describe an 

20 equivalent string with respect to the noncommutation graph G. It follows that the typical 
sequences of length n fall into approximately 2 nHi(G,P) typical interchange classes. 

It is generally considered to be difficult to determine or even bound the 
limiting constants obtained by a subadditivity argument. For the present problem, there 
are two straightforward approaches to bounding H.(G,P) from above. The first of these 

25 is to simply reproduce the exact source output string. For a discrete, ergodic source with 
finite alphabet and entropy H(P), it is known that n~ x C(U x ...U n ) converges to H(P) with 

probability 1. This procedure is optimal when G is the complete graph on V. Another 
approach is to count the number of interchange classes for a particular string length and 
allocate a fixed-length codeword to each interchange class. More precisely, an alphabetic 
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ordering can be assigned to the elements of the vertex set, follow T. M. Cover and J. A. 
Thomas, Elements of Information Theory, 152 (Wiley, New York, 1991), and use the 
program "Generate, in lexicographic order, all alphabetically minimal elements of the 
interchange classes of length n. Of these words, print the i th word." 
5 The moment generating function for the number of interchange classes for 

words of a given length was shown to be equal to the inverse of the Mobius polynomial 
corresponding to a function of G, Recently, a formula for the dominant term in the 
asymptotic expansion of the number of traces was provided in M. Goldwurm and M. 
Santini, "Clique Polynomials Have a Unique Root of Smallest Modulus," Information 
10 Processing Letters 75(3), 127-132, (2000). In the special case where G is the empty 
graph, it is well known that the number of type classes of length n for a vertex set V with 

cardinality \v\ is at most (n + 1)^ . Hence, if G is the empty graph, then H^G.P) = 0 for 

all probability distributions P. The Korner graph entropy and chromatic entropy are also 
known to be H(P) when G is the complete graph on the vertex set and 0 when G is the 

1 5 empty graph on the vertex set. 

The characterization of interchange classes by type class and edge 
projections provided in Theorem 2.1 implies that the interchange entropy is monotonic, 
subadditive, and for memoryless sources satisfies two special cases of additivity under 
vertex substitution. Let E denote the edge set of a graph. 

20 Proposition 2.6 (Mono tonicity): If F and G are two graphs on the same 

vertex set and the respective edge sets satisfy E(F) c E(G) , then for any word x we have 
C i (F,x)<C i (G 9 x). Hence, for any probability distribution P we have 
H.(F, P) < H.(G, P) . The Korner graph entropy and chromatic entropy are also known to 
be monotonic. 

25 Proposition 2.7 (Subadditivity): Let F and G be two graphs on the same 

vertex set V and define F^jG to be the graph on V with edge set E(F) u E{G) . For any 

word x, C i (FKjG 9 x)<C i (F 9 x) + C i (G 9 x) + OQ). Therefore, for any fixed probability 
distribution P, H. (F kjG,P)< H. (F, P) + H. (G, P) . The Korner graph entropy is also 
subadditive. 

30 The concept of substitution of a graph F for a vertex v in a disjoint graph G 
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is described in G. Simonyi, §3. The idea is that v and the edges in G with v as an endpoint 
are removed and every vertex of F is connected to those vertices of G that were adjacent to 
v. This notion can be extended to a property of Korner graph entropy known as 
"additivity of substitution." The concept does not hold in general for the interchange 
5 entropy, but there are two special cases which apply. The first one is concerned with 
graphs consisting of more than one connected component. 

Proposition 2.8: Let the subgraphs Gj denote disjoint components of the 
graph G; i.e., there is no edge in E(G) with one endpoint in V (Gj) and the other in V (Q) 
for j ' * I . For a memoryless source with probability distribution P on V(G) define the 
10 probability distributions" 

Pj{x) = P{x)[p{V{Gj))r\x e V{Gj) . (3) 

Then H i (G, P) = P(V(Gj ))H. (G y . ,Pj). 

An example illustrates that Proposition 2.8 fails in general to hold for the 
output of sources with memory. Suppose that V = {a, b, c, d}, G = a — b c — d, and the 
15 source is an order- 1 Markov chain with P(c\a) = P(d\b) = 1, 

P(a\c) = P(b\c) = P(a\d) = P(b\d) = 0.5 . Assume that the first symbol is equally likely to 

be an a or a b. In other words, the source outputs two symbols at a time independently 
with half being ac and the other half being bd. It is easy to verify that the entropy of the 
original source is 0.5 bits per symbol. Next suppose F = a — b c d. In order to represent a 

20 word congruent to the source output with respect to F, the projection of the string onto the 
subalphabet {a, b} must be precisely characterized. Note that this projection looks like the 
output of a binary, memoryless source with P(a) = P(b) = 0.5. Since half of the symbols 
from the original string appear in the projection, it follows that // / (F,P) = 0.5 bits per 
symbol. Therefore Proposition 2.6 implies that H^G.P) = 0.5 bits per symbol. Let Gi = 

25 a — b and G 2 = c — d. Observe that 

H § (G, P) * P(V{G X ))//,. (G, , i> ) + P{V(G 2 ))//. (G 2 , P 2 ) = 1 bit per symbol. The reason that 
Proposition 2.8 is invalid in this case is that the projection of source output symbols onto 
G2 is perfectly correlated with the projection of source output symbols onto Gl in that it 
can be obtained by replacing each a with a c and each b with a d. 

30 A second example of additivity of substitution for the interchange entropy 
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is considered assuming the original source string is the output of a memoryless source. 

Proposition 2.9: Let F be a graph consisting of two vertices x and y and an 
edge connecting them, let G be a graph with vertex set disjoint from F, and let v be a 
vertex of G. Form the graph G V _ F by deleting v and joining both vertices of F to those 
vertices of G which were adjacent to v. For a memoryless sources with probability 
distribution P V4 ^ F on V(G V< _ F ) , we define two auxiliary memoryless sources, one over V 
(G) with probability distribution P and the other over V (F) with probability distribution Q 
as follows 

\P^ F (x) + P^ F {y), u = v K) 



Q{x) = 

000 = 



P(v) 
P(v) 



Then H,(G^,P^ F ) = H,(G,P) + P(y)H(Q) . 

For discrete memoryless sources, the exact expression is obtained for 
Hg(G, P) in the case where G is a complete k-partite graph K mi „ h . 

Theorem 2.10: Assume a discrete, memoryless source with probability 
distribution P on vertex set V. Suppose V is of the form V = V x u V 2 u . . . u V K with 
\Vj | = m. , i e {1,2, ...,*} and label the elements of V { as vj, i e {1,2, ...,*},y g {1,2, . . . , } . 
For our complete k-partite graph K there is an edge corresponding to every pair of 

vertices {v lV ,v /rt },v lV e V n v ln ev,,/^/, and no two vertices from the same subset Vj are 
adjacent for any / e {1,2,...,*} . Define Q. = J^ ml P(y t9 j) 9 i e {1,2,...,*}. Then 



^ , .„ 2 mk . f) = " (jo - Z lQ g 2 (s) Z o - a ) 

S=2 /:m,£2 



'"> 



^l- a+ />(v,.)J J 



Theorem 2.10 leads to the following property of the interchange entropy for 
the output from a discrete, memoryless source. 

Corollary 2.11: Assume a discrete, memoryless source with probability 
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distribution P on vertex set V (G). If G is not the complete graph on V (G), then 
H t {G,P)<H(P). 

The example following Proposition 2.8 illustrates that it is possible for a 
source with memory to satisfy H t {G,P) = H(P) even when the dependence relation G is 
5 not the complete graph on V(G). 

An example illustrates some of the results in this section. Suppose the 
noncommutation graph G is a — b — c and P(a) = P(b) = P(c) =1/3. A simple upper bound 
for //,(G,/>) is //,(G,P) < H(P) = log 2 3 » 1.58496. Define the graphs Fi = a— b c, F 2 = 
a b— c, F 3 = c— a b. By symmetry, H.(F l P) = H i (F 2 ,P) = H i (F 39 P). It follows from 

10 Proposition 2.8 that H t (F l9 P) = | ■ 1 + ~ 0 = | . Since G = F 1 ^jF 2 , Proposition 2.7 

implies that H. (G, P) < H, {F x , P) + H. (F 2 , P) = 4 / 3 « 1 .33333 . Therefore, for this source 

and dependence relation, a compression scheme consisting of encoding the two edge 
projections and having the decoder use the edge projections to reconstruct a word in the 
equivalence class would require fewer bits per symbol on average than losslessly 
15 compressing the entire string. Since F x aG and F 2 c:G, Proposition 2.6 provides that 
H i (G, P) > H. (Fj , P) « 0.66667 . Another lower bound on H. (G, P) follows from the fact 
that the complete graph on the vertex set is G u F3. Therefore, by subadditivity, 
H(P) < H t (G, P) + H, (F 3 , P) and so H ( (G, P) > log 2 3 ~ 2 / 3 « 0.91830. Since G is a 
complete bipartite graph, Theorem 2.10 implies that 

20 //,.(£,/>) = log 2 3 

The following section considers some universal compression algorithms for 
the problem of representing interchange classes and begins with a discussion of normal 
forms. 

NORMAL FORMS AND VARIATIONS 
25 There are two types of normal forms which are frequently discussed in the 

trace theory literature. One of these is known as the lexicographic normal form and was 
first considered in A. V. Anisimov and D. E. Knuth, "Inhomogeneous Sorting," Int. J. 
Comp. Inform. Sci. 8, 255-260 (1979). The other normal form is called the Foata normal 



2) 



log 2 (5)« 1.27645. 



16 Savari 3 

form, described in P. Cartier and D. Foata, "Problemes Combinatoires de Commutation et 
Rearrangements, Lecture Notes in Mathematics 85 (Springer, Berlin, 1969). 

In order to compute either normal form, a total ordering on the vertex set V 
must be given. The lexicographic normal form of an interchange class is the unique word 
in the interchange class which is minimal with respect to the lexicographic ordering. 
Continuing the example considered in the introduction, assume a none ommutat ion graph 
G is given by a — b — c — d and suppose that a < b < c < d. The lexicographic normal form 
of the interchange class containing the two words ddbca and bdadc is baddc. It has been 
shown that a necessary and sufficient condition for a word w to be the lexicographic 
normal form of an interchange class is that for all factorizations w = xvyuz such that u and 
v are commuting symbols in V with u < v; x and z are possibly empty words over V, and y 
is a non-empty word over V, there exists a letter of y which does not commute with u. 

In order to define the Foata normal form, the notion of finite non-empty 
subsets of pairwise independent letters is needed. Define the set F by F = {F c V \ F is 
non-empty, F contains at most one appearance of any symbol v e V , and every pair of 
symbols w,v e F with u * v commute.} 

Each F e F F is called an elementary step and it can be converted into a 
type class denoted by [F] consisting of words which are products of all of the elements of 
F. 

The Foata normal form of an interchange class c is the unique string of 
elementary steps v e <f> . with r > 0 and <f> x e F,i e {l,2,...,r}, with the properties 

•C = W]fc]-..Wr] 

• for each 1 < i < r and each letter u e <fi i+l there exists a letter v e <j>. either 
satisfying v = u or u and v are adjacent in the noncommutation graph G. 

The number of elementary steps r in the Foata normal form is a measure of 
the parallel execution time associated with an interchange class. P. Cartier and D. Foata 
was the first to establish that the Foata normal form is well-defined and there are many 
proofs of this result. To return to the previous example, when the noncommutation graph 
G is given by a— b^-c— d, it follows that F ={ {a},{b},{c},{d},{a,c},{a,d},{b,d} } and 
the Foata normal form for the interchange class containing the words ddbca and bdadc is 
{b,d},{a,d},{c}. 
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An algorithm (as well as exemplary pseudocode) to compute both the 
lexicographic normal form and the Foata normal form of an interchange class from one of 
its members was provided in D. Perrin, "Words Over a Partially Commutative Alphabet," 
in A. Apostolico and Z. Galil, ed., Combinatorial Algorithms on Words, NATO ASI 
Series, Volume F12, 329-340, (Springer, Berlin, 1985), incorporated by reference herein. 
FIG. 3 is a flow chart describing an exemplary implementation of the normal form 
generation process 300 that may be employed by the compression process of FIG, 2. The 
procedure employs a stack corresponding to each vertex vgF. Let w be a word over the 
alphabet V. The symbols of w are processed during step 310 from right to left. Upon 
seeing a letter u, a u is pushed on its stack and a marker is pushed on the stacks 
corresponding to symbols which are adjacent to u in the noncommutation graph G during 
step 320. A test is performed during step 330 to determine if the entire word has been 
processed. When the entire word has been processed, the stacks can be used during step 
340 to determine either the lexicographic normal form or the Foata normal form for the 
interchange class containing the word. 

• To obtain the lexicographic normal form: At each step the next 
letter of the normal form is the minimum letter u with respect to the 
lexicographic ordering which is currently at the top of some stack, u is 
popped from its stack and also pop a marker from each stack corresponding 
to a vertex veV which is adjacent to u in G. This procedure is iterated 
until every stack is empty. 

• To derive the Foata normal form: At each step the members of the 
next elementary step are those letters which are on the tops of stacks. We 
pop these letters from their stacks and for each member u of the elementary 
step we also pop a marker from each stack corresponding to a letter veF 
which does not commute with u. This procedure is iterated until every 
stack is empty. 

Resuming the preceding example, when the dependence relation G is a — 
b — c — d and the original word is ddbca, the resulting stacks are shown in FIG. 4. It is 
straightforward to verify that the procedures specified above lead to baddc as the 
lexicographic normal form and {b,d},{a,d},{c} as the Foata normal form. 

Given these notions of normal forms, there are three categories of 
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techniques that will be considered for transforming a source output string before a 
universal grammar-based lossless data compression scheme is applied. The first of these 
selects a total ordering on the vertex set V and finds the lexicographic normal form of the 
interchange class containing the source output string. Observe that for every pair of 
5 symbols u and v with u < v which commute in G, the lexicographic normal form derived 
from a word never contains the substring vu. 

The other two categories of processing the source output string are based 
upon the Foata normal form. Let F l9 F 29 ... 9 F / be all of the elementary steps that constitute 

F; i.e., F = Jj\^ x F i . For each F n one word w. is selected in the type class [F.]. 

10 Therefore, for one category of source output string transformations the source output 
string is mapped into the concatenation of words obtained by replacing each elementary 
step in its Foata normal form with the corresponding representative word. Persisting with 
the foregoing example, if the dependence graph G is given by a — b — c— d and the words 
ca, ad, and db are selected to respectively represent the elementary steps {a,c},{a,d}, and 

15 {b,d}, then the strings in the interchange class containing the words ddbca and bdadc are 
all mapped into dbadc. 

For the last transformation, a superalphabet V s of the vertex set V is 
defined corresponding to the I elements of F. For example, let V ac ,V ad and V bd be new 

letters respectively corresponding to the elementary steps {a,c}, {a,d}, and {b,d}. Each 
20 string is represented in an interchange class with the concatenation of superletters effected 
by substituting each elementary step in its Foata normal form with the analogous 
superletter. Continuing the last example, the words ddbca and bdadc would be 
transformed into v bd v ad c . The outcome of this last transformation is to map a word into a 

possibly shorter one over a larger alphabet. Observe that this representation of an 
25 interchange class highlights the parallelism leading to a minimal execution time. 

The transformations defined above can be used for any noncommutation 
graph G. It is mentioned in passing that when G is not connected, the option is available of 
finding its components, projecting the original string onto each subalphabet consisting of 
the vertices of a component, and proceeding to use any of the three categories of normal 
30 form representations listed above for mapping the projections of the original string. 

COMBINING NORMAL FORMS AND IRREDUCIBLE GRAMMAR-BASED CODES 
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The normal form can be as the string which is the output of an auxiliary 
source. In general, the auxiliary source is not ergodic. For example, suppose you have a 
binary source which is not necessarily ergodic emitting the digits 0 and 1 and the digits 
commute. As discussed above, the interchange entropy of this source is zero. If the 
5 lexicographic order is selected 0 < 1 and the binary string contains 1 zeroes and m ones, 
then its lexicographic normal form is a run of 1 zeroes followed by a run of m ones, its 
Foata normal form is min{Lm} copies of the string 01 concatenated with 1-m zeroes if l>m 
or m-1 ones if m >1, and the final transformation is min{l,m} copies of the auxiliary 
symbol V m followed by a run of 1-m zeroes if 1 >m or m-1 ones if m>l. The first and third 

10 of these normal forms are piecewise stationary and the second one is piecewise ergodic. In 
each case, it can be shown that many compression schemes including LZ '78 and Sequitur 
will asymptotically approach zero bits per symbol on the output of the auxiliary source as 
the original string length approaches infinity. 

To illustrate another difficulty, the example following Proposition 2.8 is 

15 again considered. Suppose once more that V={a,b,c,d},G=a — b c — d, and the source is an 
order- 1 Markov chain with P(c|a)=P(d|b)=l,P(a|c)=P(b|c)=P(a|d)=P(b|d)=0.5. As discussed 
above, H(G,P)= H(P)=0.5 bits per symbol. Next assume that the total ordering of the 
vertex set is a<b<c<d and begin to process a source output string by converting it into its 
lexicographic normal form. Then, for a string of length 2N, the first N symbols look like 

20 the output of a binary, memoryless source with P(a)=P(b)=0.5. The remaining N symbols 
can be found from the first N by replacing each a with a c and each b with a d. It is in 
some respects accurate to state that the information rate of this auxiliary source is 0.5 bits 
per symbol. However, if a grammar-based code or any other practical lossless universal 
data compression algorithm is naively applied to the output of the auxiliary source, then 

25 the minimum average compression rate achievable will be 1 bit per symbol. The 
transformations based upon the Foata normal form are better suited for this particular 
compression problem. 

Two instances are demonstrated below for which the auxiliary source is 
Markov with a countably infinite state space and which has the property that the auxiliary 

30 source entropy is equal to the original source's interchange entropy. Since a universal 
grammar-based code compresses an ergodic source to its entropy, the combined codes of 
the present invention compress a source to the interchange entropy in these special cases. 
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First consider dependence relations which are complete k-partite graphs. 
As in the section entitled "Interchange Complexity and Interchange Entropy," the vertex 
set V is represented by V x uK 2 u...u^,|^ \=m i9 i e {1,2,...,*}, the elements of V. are 
labeled as V iJ9 ie {l,2,...,A:},y e {l,2,... 9 m.} and it is assumed that every pair of vertices 
5 Ky> v i,nl v ij e V i> v i, n G V l9 l * i , is an edge in the noncommutation graph G and that there 
is no edge consisting of two vertices from the same subset of vertices V. for any 
/e {l,2,...,&}. Again, consider the partitioning of the data string into a sequence of 
variable-length phrases corresponding to maximal runs of symbols from a vertex subset 

v t . 

10 An auxiliary source is specified which captures both the mapping into 

lexicographic normal form and the first transformation into Foata normal form. It is 
assumed that each phrase from the original source is converted to a string which is the 
unique designated representative for the type class for that phrase. The auxiliary source is 
then a countably infinite Markov chain where the state at any time consists of the suffix of 

15 the designated representative phrase beginning with the current symbol. While within a 
phrase the auxiliary source has no uncertainty in the transition from one state to the next; 
i.e., there is a single possible transition that occurs with probability 1. All of the 
uncertainty resides in the transition from the final letter in a phrase to the first state 
corresponding to the next phrase, and these transition probabilities depend only on the 
20 vertex subset associated with the current phrase. Let H a [K 9 P) denote the entropy 

of the auxiliary source. 

Theorem 4.1: Assume a discrete memoryless source with probability 
distribution P on the vertex set of a complete-k partite graph K m ^ m ^ ^ . Segment the 

output of the source into a sequence of variable-length phrases corresponding to maximal 
25 runs of symbols from a vertex subset. Replace each phrase by a string from the same type 
class which is the sole assigned representative for that particular type class. The entropy 
»M^^ 9 P) of this modified source satisfies H^^^P^H^K^^Py 

Consider the third transformation of the original source into an auxiliary 
source. In this case, the superalphabet V s consists of the union over all i e {!,...,£} of all 
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non-empty subsets of the vertex subset V t . The source string of length n over the vertex 

set V is mapped into a generally shorter string over the superalphabet V s . The definition 
of a phrase remains identical, and the requirement is maintained that each phrase from the 
original source be converted into a string over the superalphabet which is the unique 
designated representative for the type class for that phrase and from which the original 
string can be recovered. 

Theorem 4.2: The entropy H a (K mjttt2 mk ,p) of an auxiliary source 

corresponding to the sequence of superletters given by the Foata normal form on the 
output of a discrete memoryless source with probability distribution P on the vertex set of 
a complete k-partite graph K m m m satisfies H,(K m m ,p)=H (k 

Consider the case where the noncommutation graph contains at least one 
vertex which is adjacent to all others. Let V a czV be the set of symbols which do not 
commute with any others. In this case, the source output string can be uniquely partitioned 
into a sequence of variable-length phrases consisting of zero or more symbols not in V a 
followed by a symbol in V a . Since no symbol in V a commutes with any other, the 
projections onto the subalphabets associated with the various edges can be computed by 
the sequence of interchange classes corresponding to the phrases. Conversely, these 
projections can be used to determine this sequence of interchange classes. It follows from 
Theorem 2.1 that the minimum information required to perfectly reconstruct the 
interchange class containing a source output string is the sequence of interchange classes 
corresponding to the variable-length phrases. For a memoryless source, this sequence of 
interchange classes is an independent and identically distributed process. Thus, H { (G 9 P) 
can in principle be found using renewal theory. Let U l9 U 2 ,... 9 U n denote a random string 
of length n. There are epochs 1 = M X <M 2 <... for which the symbols 
U M t >U Mi+l9 ... 9 U MM _ t make up the /th phrase, />1. Let T t =M M -M /5 1>1, denote the 
number of symbols in the /th phrase and R t represent the self- information of the 
interchange class associated with phrase /. If a x is the original source string 
corresponding to phrase 1, then R t = - log 2 [p(a l ) | 0 : 0 = G cr, |] . Let 
L(n) = sup {I :M l+l <« + !} denote the number of complete phrases by symbol n. The 
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average self-information per symbol is bounded from below by X/-?^/ and 
from above by ^*" )+ * R t /X/t? 7 ) * In the limit as n ~> 00 > the following equation holds: 

H i {G,P) = \xm L ~^? Rl 

Next, consider auxiliary sources (viewed as a countably infinite Markov 
chain where the state at any time consists of the suffix of the present designated 
representative phrase beginning with the current symbol). While within a phrase, there is a 
single possible transition from one state to the next that occurs with probability 1 . All of 
the uncertainty lies in the transition from the final letter in a phrase to the first state 
marking the beginning of the next phrase, and these transitions are independent and 
identically distributed. In order to compute the entropy H a (G,P) of the auxiliary source, 
the probability n that the auxiliary source is on the last symbol of a phrase is needed. 
Consider a reward process where phrase 1 receives a reward of 1 unit corresponding to the 
last symbol in the phrase. The average reward per symbol is bounded from below by 

L(n)/^ )+l T { and from above by [L/^ + l]/^^^ . In the limit as n — > oo the upper 

and lower bounds both approach n = [lim^ E\T t J]" 1 almost surely. At the last symbol in 
a phrase, the probability that the next phrase from the auxiliary source is a is 
P{ay 1 0 : 0 = G a | if a is one of the designated strings representing the interchange class 
of a phrase and the probability is zero otherwise. It follows the formula for the entropy of 
a unifilar, Markov source that H t {G,P) = H a (G,P) . 

Consider the case where the original source is a finite state, unifilar Markov 
source and the dependency graph is either a complete k-partite graph or a graph where at 
least one vertex is adjacent to all of the others. In this case, the interchange class of the 
phrases combined with some information about the state of the original process at the 
beginning and end of the phrases forms a countably infinite state, ergodic Markov chain. 
The exact states of the original process at the beginning and end of the phrases need not be 
necessary. For example, in the complete k-partite case if V t = i and there are distinct 
states which have the identical behavior when the source emits elements of V { , they can be 
merged when describing the state preceding phrase 1. The process of transforming the 
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original source into an auxiliary source maintains the information about the phrases and 
the correct transition probability from one phrase to the next. Hence, the entropy of the 
auxiliary source will be equal to the interchange entropy. 

As is known in the art, the methods and apparatus discussed herein may be 
5 distributed as an article of manufacture that itself comprises a computer readable medium 
having computer readable code means embodied thereon. The computer readable program 
code means is operable, in conjunction with a computer system, to carry out all or some of 
the steps to perform the methods or create the apparatuses discussed herein. The computer 
readable medium may be a recordable medium (e.g., floppy disks, hard drives, compact 

10 disks, or memory cards) or may be a transmission medium (e.g., a network comprising 
fiber-optics, the world-wide web, cables, or a wireless channel using time-division 
multiple access, code-division multiple access, or other radio-frequency channel). Any 
medium known or developed that can store information suitable for use with a computer 
system may be used. The computer-readable code means is any mechanism for allowing a 

15 computer to read instructions and data, such as magnetic variations on a magnetic media 
or height variations on the surface of a compact disk. 

The computer systems and servers described herein each contain a memory 
that will configure associated processors to implement the methods, steps, and functions 
disclosed herein. The memories could be distributed or local and the processors could be 

20 distributed or singular. The memories could be implemented as an electrical, magnetic or 
optical memory, or any combination of these or other types of storage devices. Moreover, 
the term "memory" should be construed broadly enough to encompass any information 
able to be read from or written to an address in the addressable space accessed by an 
associated processor. With this definition, information on a network is still within a 

25 memory because the associated processor can retrieve the information from the network. 

It is to be understood that the embodiments and variations shown and 
described herein are merely illustrative of the principles of this invention and that various 
modifications may be implemented by those skilled in the art without departing from the 
scope and spirit of the invention. 
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